πŸ•ΈοΈ Ada Research Browser

spec.md
← Back

Feature Specification: Cloud Snapshot Demo Lifecycle

Feature Branch: 008-cloud-snapshot-lifecycle Created: 2026-02-27 Status: Draft Input: Add snapshot-based cloud demo lifecycle management to the existing Hetzner Cloud infrastructure, enabling near-instant demo readiness by restoring from pre-built snapshots instead of provisioning from scratch

User Scenarios & Testing (mandatory)

User Story 1 - Warm Start a Demo Cluster from Snapshots (Priority: P1)

As a presenter preparing for a stakeholder meeting, I need to bring up a fully-provisioned CUI demo cluster in under 5 minutes so that I can demonstrate compliance capabilities without a 25-minute cold-start delay.

Why this priority: This is the core value proposition. The entire feature exists to eliminate the provisioning bottleneck that prevents practical demos. Without fast cluster restore, no other functionality in this feature matters.

Independent Test: Can be fully tested by having a snapshot set available (from a previous cold build) and running the warm-start command. Verify all 4 VMs are accessible, all services are running, and demo scenarios execute successfully.

Acceptance Scenarios:

  1. Given a snapshot set exists from a previous cluster build, When I run the warm-start command, Then 4 VMs are created from snapshots with the same server types as the original cluster (mgmt01: cpx21, others: cpx11)
  2. Given VMs are created from snapshots, When the warm-start completes, Then all nodes are attached to a private network with the same IP assignments (mgmt01: 10.0.0.10, login01: 10.0.0.20, compute01: 10.0.0.31, compute02: 10.0.0.32)
  3. Given the cluster is restored, When I check service health, Then FreeIPA, Slurm, Wazuh, NFS, Munge, and chronyd are all running on their respective nodes
  4. Given the cluster is restored, When I run any existing demo scenario (A, B, C, or D), Then the scenario executes identically to a cold-provisioned cluster
  5. Given no snapshot set exists, When I run the warm-start command, Then I see a clear message directing me to build a cluster first and create snapshots

User Story 2 - Create Snapshot Set from Running Cluster (Priority: P1)

As a presenter who has just completed a successful cold-build provisioning, I need to snapshot the entire cluster so that future demos can start in minutes instead of waiting for full provisioning.

Why this priority: Equal to warm-start. Without the ability to create snapshots, there is nothing to restore from. This is the "plant the seed" step that enables all future fast starts.

Independent Test: Can be fully tested by running demo-cloud-up.sh to completion, then creating snapshots, and verifying the snapshot set is listed and contains metadata for all 4 VMs.

Acceptance Scenarios:

  1. Given a running, fully-provisioned cluster, When I run the snapshot command, Then all 4 VMs are snapshotted via the cloud API
  2. Given snapshots are being created, When the process runs, Then I see progress output showing each VM being snapshotted with its name and status
  3. Given snapshots are complete, When I list available snapshot sets, Then I see the new set with creation date, VM names, and snapshot identifiers
  4. Given a successful cold-build via demo-cloud-up.sh, When provisioning completes, Then I am prompted with the option to snapshot the cluster for future fast starts
  5. Given I have multiple snapshot sets, When I list them, Then they are displayed chronologically with identifying labels

User Story 3 - Health Check a Running Cluster (Priority: P1)

As a presenter about to start a demo, I need to verify that all critical services are operational so that I can confidently begin my presentation without surprises.

Why this priority: A restored cluster is only useful if services actually came back correctly. The health check is the trust layer that confirms readiness. It runs automatically during warm-start but must also be available independently.

Independent Test: Can be tested by running the health check against any running cluster (cold-built or restored) and verifying it produces a clear pass/fail summary.

Acceptance Scenarios:

  1. Given a running cluster, When I run the health check, Then I see a summary table showing pass/fail status for each service on each node
  2. Given all services are healthy, When the health check completes, Then it exits with code 0 and displays an all-clear message
  3. Given one or more services are down, When the health check completes, Then it exits with a non-zero code and clearly identifies which services on which nodes have failed
  4. Given a warm-start has just completed, When the warm-start process finishes, Then the health check runs automatically as a final verification step

User Story 4 - Graceful Session Wind-Down (Priority: P2)

As a presenter who has finished a demo session, I need to cleanly shut down the cluster with the option to preserve current state before teardown so that demo artifacts are not lost and billing stops promptly.

Why this priority: Important for cost management and data preservation, but secondary to the core warm-start/snapshot workflow. Users can always use the existing demo-cloud-down.sh as a fallback.

Independent Test: Can be tested by running the wind-down command on a running cluster, optionally choosing to snapshot first, and verifying all resources are destroyed and cost summary is displayed.

Acceptance Scenarios:

  1. Given a running cluster, When I run the wind-down command, Then I am asked whether to snapshot current state before teardown
  2. Given I choose to snapshot before teardown, When teardown proceeds, Then a snapshot set is created before resources are destroyed
  3. Given I choose not to snapshot, When teardown proceeds, Then resources are destroyed immediately (with confirmation)
  4. Given teardown completes, When the process finishes, Then I see session duration and estimated cost for the session

User Story 5 - Manage Snapshot Sets (Priority: P2)

As a user managing cloud costs, I need to list and delete old snapshot sets so that I do not accumulate storage charges for outdated snapshots.

Why this priority: Housekeeping capability that prevents cost creep. Not needed for initial demo workflows but becomes important over time.

Independent Test: Can be tested by creating multiple snapshot sets, listing them, deleting one, and verifying it no longer appears in the list.

Acceptance Scenarios:

  1. Given multiple snapshot sets exist, When I list them, Then I see each set with creation date, label, and number of snapshots
  2. Given I identify an old snapshot set, When I delete it, Then all snapshots in the set are removed from the cloud provider
  3. Given I delete a snapshot set, When I list remaining sets, Then the deleted set no longer appears

Edge Cases

Requirements (mandatory)

Functional Requirements

Snapshot Creation

Snapshot Restore (Warm Start)

Health Check

Session Wind-Down

Snapshot Management

Integration

Key Entities

Success Criteria (mandatory)

Measurable Outcomes

Scope

In Scope

Out of Scope

Assumptions

Dependencies

Clarifications

Session 2026-02-27